Machine Learning
Table of Contents
- 1. Paradigms
- 2. Perception Space and Concept Space.
- 3. World Model
- 4. Latent Space
- 5. Loss Function
- 6. Regression
- 7. Variational Inference
- 8. Gradient Descent/Ascent
- 9. Hopfield Network
- 10. Sparse Regression
- 11. Neural Network
- 12. Reinforcement Learning
- 12.1. Markov Decision Process
- 12.2. Generalized Policy Iteration
- 12.3. Temporal Difference Control
- 12.4. Function Approximation
- 12.5. Policy Gradient Methods
- 13. Interpretation
- 14. History
- 15. Products
- 16. Tools
- 17. Applications
- 18. Reference
1. Paradigms
1.1. Supervised Leaning
- Trained using labeled data
1.2. Unsupervised Leaning
- Unlabeled data
1.3. Reinforcement Leaning
- Use rewards
2. Perception Space and Concept Space.
A layer of neurons are like the classifying data with bunch of lines. Multiple layers correspond to cumulatively folding the paper.
3. World Model
An world model was given to the chess engine in 1970s. Then the AlphaGoZero build its own world model from its experience. LLM's world model is genral enough that it applies to individual cases.
The world model can be used to make the model dream. If the search space is too large, model uses intuition that is given by a neural network, and rollout random possible sequences of moves.
The recent GPT-4 uses the rollout to the thoughts within the tree of thoughts, to find the optimal response.
4. Latent Space
The embedding space.
4.1. Visualization
- Latent Space Visualisation: PCA, t-SNE, UMAP | Deep Learning Animated - YouTube
- Principal Component Analysis
- Fast but not extensible.
- t-SNE
- Maximize the probability that the low dimensional representation happens, according to a Gaussian distribution around each point.
- UMAP
- Find k nearest neighbor, and match the weights of the low dimensional adjacency graph with the high dimensional one.
5. Loss Function
5.1. Functions
5.1.1. Cross-Entropy Loss
- Calculate the cross entropy of the prediction with respect to the values that are expected.
5.1.1.1. Advantages
- Works well when the model outputs are probabilities
- Good for imbalanced classes
- Offers smoother gradients than other loss functions
6. Regression
6.1. Linear Regression
7. Variational Inference
The goal is to approximate the probability distribution \( P(x) \). We postulate the existence of a latent factor \( z \) that affects the probability of \( x \) given as \( P(x\mid z) \). The latent factor also has a probability distribution \( P(z) \), factoring the distribution of \( x \): \[ P(x) = P(x\mid z)P(z). \]
8. Gradient Descent/Ascent
8.1. Learning Rate
- The coefficient of the gradient.
9. Hopfield Network
- A Brain-Inspired Algorithm For Memory - YouTube
- Nodes with symmetric weights between each pair of them. The weight
signifies the correlation of two nodes.
- Model of the Hebbian theory.
- One a network learns the appropriate weight. Then it can be recovered
by sequential improvement.
- updating each note to be the most probable value at each sweep. It is guaranteed that it converges.
- Given that the interference is negligable.
- Multiple information can be stored at the same time by averaging.
- \(n_\text{patterns} \approx 0.14\cdot N_\text{neurons}\)
10. Sparse Regression
10.1. Loss Function
The goal is to minimize the error of the approximation function and the complexity of the it. It is achieved with the loss function being an linear combination of two function: \[ L = L_{\mathrm{MSE}} + \alpha L_{\mathrm{count}}. \]
The \( L_{\mathrm{count}} \) can be in various form, something similar to \[ L_{\mathrm{count}} = \sum_i|w_i|^p \] when \( p \) is close to zero, this is an approximation of the maximum function. But often \( p=1 \) is used while producing reasonable result, and with simpler calculation using convex optimization.
\( w_{i} \) is the coefficient of the predetermined feature function, that linearly combines to produce the approximation function.
11. Neural Network
- Machine learning can occur on this.
11.1. Rectifiers
11.1.1. ReLU
- Rectified Linear Unit
\[ \mathrm{ReLU}(x) := \max (0,x). \]
11.1.2. Leaky ReLU
\[ \mathrm{LReLU}(x) := \max(x, \varepsilon x) \] where \( 0 < \varepsilon \ll 1. \)
11.1.3. ELU
- Exponential Linear Unit
\[ \mathrm{ELU}(x) := \begin{cases} x & x >0, \\ \alpha(e^x -1) & x\le 0 \] where \( \alpha \ge 0 \).
11.2. Softmax
The softmax function \( \sigma\colon \mathbb{R}^k \to (0,1)^k \) where \( k\ge 1 \), is defined by: \[ \sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^k e^{z_j}}. \]
11.3. Architecture
11.3.1. Multilayer Perceptron
- Researched in 80s and 90s.
- But what is a neural network? | Deep learning chapter 1 - YouTube
- It is useful to have rough concrete meaning of each layer (beforehand), to assess the network.
- Layer \(n\) Neurons
\[a_{j}^{(n)}=\sigma\left( b_{j}^{(n)} + \sum_{i}w_{ij}^{(n)}a_{i}^{(n-1)}\right)\]
- where \(\sigma: \mathbb{R} \to [0,1]\) is usually the sigmoid
function, aka logistic curve: \[
\sigma(x) = \frac{1}{1+e^{-x}}
\]
- It could be any other functions, such as ReLU.
- with matrix notation: \[ \mathbf{a}^{(n)} = \sigma\left(\mathbf{b}^{(n)}+\mathbf{W}^{(n)}\mathbf{a}^{(n-1)}\right) \]
- where \(\sigma: \mathbb{R} \to [0,1]\) is usually the sigmoid
function, aka logistic curve: \[
\sigma(x) = \frac{1}{1+e^{-x}}
\]
- The entire neural network can be thought of as a function, as well as each neuron. \[ \mathbf{a}^{(-1)} = \mathbf{M}\left(\mathbf{\mathbf{a}}^{(0)}\right) = \dots\mathbf{M}^{(2)}\mathbf{M}^{(1)}\mathbf{a}^{(0)} \]
11.3.2. Recurrent Neural Network
- RNN
- Serial processing. Like for audio.
11.3.2.1. Long Short-Term Memory Network
- LSTMN
- Deep layers of small recurrences.
11.3.3. Convolutional Neural Network
- Used for image processing
- Perform multiple 2D convolution at each layer, to detect features.
11.3.4. Autoencoder
- Extract the core features and lift back to the original shape.
11.3.5. Transformer
- Self Attention Layer: Dynamic layer that adapt the weights based on the context. Measures the distance in the concept space.
- It is so named because it transforms the meaning of a sentence in the concept space.
- It is shallower but wider.
- Transformer Neural Networks Derived from Scratch - YouTube
- Calculate the attention of every pair of input vectors with an attention head neural network, and use it as a weight to reduce it down.
11.3.6. Auto-Regression
- Use the output, as the input for the next inference.
11.3.7. Diffusion
When generating an image, the diffusion model performs Langevin sampling, specifically using the Euler-Maruyama method: \[ \mathbf{x}_{t+1} = \mathbf{x}_t + \frac{\epsilon}{2} F(\mathbf{x}_t) + \sqrt{2\epsilon} \mathbf{z}_t, \] in order to sample from the space of all images, where the score function \( F \) is approximated by the trained model \( M_{\theta} \).
The random noise \( \mathbf{z}_t \) serves two main purpose.
- Escaping the local maxima,
- Providing high frequency information.
The model only learns to remove high frequency data, since they have to remove noise. Therefore constant supply of high frequency data is crucial for the generation of realistic image. Put it in other words, the noise alleviates the high correlation of contiguous pixels, generating a more realistic image.
11.4. Learning
11.4.1. Quantification
11.4.1.1. Loss Function
- The wrongness of the model with a single data
- \(\operatorname{square\,loss}(f(x_i|\theta),y_i) = (f(x_i|\theta)-y_i)^2\)
- For the case of vector output:
- \(\operatorname{square\,loss}(f(\mathbf{x}_{i}|\theta),\mathbf{y}_i) = |f(\mathbf{x}_i|\theta)-\mathbf{y}_i|^2\)
- See loss function.
11.4.1.2. Cost Function
- Error Function
- The wrongness of the model with the entire dataset.
- It can dramatically affect the resulting parameters of the model.
- Input: Weights and Biases, Parameters: Dataset, Output: A
Number.
- For a model, Input: Data point, Parameters: Weights&Biases, Output: Estimation.
- e.g. Mean Square Error: \[ MSE(\theta) = \operatorname*{avg}_i\left(\operatorname{square\,loss}(f(x_i|\theta),y_i)\right) \]
11.4.1.3. Objective Function
- A function that is to be optimized for a model.
- Cost function is one of them.
11.4.2. Gradient Descent
- Move in the direction of \(-\nabla C(\theta)\), repeatedly.
- In order to calculate the gradient, the activation must be continuous and differentiable.
11.4.2.1. Stochastic Gradient Descent
- Calculating the cost on the entire dataset every decent step is not practical, so we divide the data into batches to calculate on it each step.
- It is not the optimal decent, but it speeds up the computation.
11.4.3. Backpropagation
- Method of simplifying calculation by calculating layer by layer, starting from the last layer.
- \[ \frac{\partial C_i}{\partial a_{k}^{(L-1)}} = \sum_j \frac{\partial C_i}{\partial a_j^{(L)}}\frac{\partial a_j^{(L)}}{\partial a_k^{(L-1)}} \]
- Note that the gradient is readily calculated at each layer, since you already have all the values that the gradients is evaluated at.
- This is an application of ((65747a84-db71-4b38-aaf9-a357720b2aa9)), which enables multiple partial derivative at the same time.
- e.g. \[ \frac{\partial C_i}{\partial a_{k}^{(L-1)}} = \sum_j \left[2(a_j^{(L)}-y_j)\right]\left[w_{jk}^{(L)}\sigma'\!\!\left( z^{(L)}_j\right)\right] \]
- where \(z^{(L)}_j = b_{j}^{(L)} + \sum_k w_{jk}^{(L)}a_{k}^{(L-1)}\).
11.5. History
- The 35 Year History of LLMs - YouTube
- Ilya - OpenAI
- RNN trained on Amazon reviews.
- Sentiment Neuron was found
- GPT-1, GPT-2
- Transformer trained on various books and the data from the web.
- GPT-3
- Hundred times bigger than the previous version, with a hundred seventy five billion connections and ninety six layers, context window of around a thousand words.
- In-context learning
- 'prompt is the program'
- Through the reinforcement learning, InstructGPT later commercialized as ChatGPT in November, 2022.
- LLM became the kernel system for the other tasks.
- Language processing is related to conciousness?
- GPT-4, Gemini, Bard, Grok on 2023
11.6. See Also
12. Reinforcement Learning
12.1. Markov Decision Process
- Variant of Markov chain
12.1.1. Definition
12.1.1.1. Chain
- It is discrete stochastic process in which an agent take actions \(A_t\) based on the current state \(S_t\), to get to the new state \(S_{t+1}\) and get reward \(R_{t+1}\).
- The space of states is denoted as \(\mathcal{S}\), the spaces of
actions at a state \(s\) as \(\mathcal{A}(s)\), and the space of
rewards as \(\mathcal{R} \subset \mathbb{R}\).
- \(\mathcal{S}^+\) is used when it contains terminal states.
- Tabular
- The process is tabular if all states and actions can be listed.
12.1.1.2. Dynamics
- The dynamics of the agent-environment interaction is given by the probability distribution: \[ p(s', r |s, a) := \mathrm{P}[S_{t+1}=s', R_{t+1}=r\mid S_t = s, A_t = a]. \]
- Complete Knowledge
- If the dynamics is known one is said to be having the complete knowlege.
- Markov Property
- The probability only depend on the current state and action.
12.1.1.3. Policy
- Policy is the probability of action (or function to action) at each state that needs to be optimized: \[ \pi(a|s) \text{\quad or\quad } \pi(s) = a. \]
12.1.1.4. Return
- Return is the weighted sum of the future rewards: \[ G_t := \sum_{k=t+1}^\infty \gamma^{k-t-1}R_k \]
- where \(\gamma\) is often a value between 0 and 1.
- In the episodic case the process terminates, marking the end of an episode, also the summation, in contrast to the continuing case.
12.1.1.5. Trajectory
- Sequence, possibly infinite, of states, actions, rewards produced by the process.
- It can be expressed as random variables, \[ S_0, A_0, R_1, S_1, A_1, R_2,\dots, R_T,S_T \]
- or as values, \[ s_0^m, a_0^m, r_1^m, s_1^m, a_1^m,\dots, r_T^m, s_T^m. \]
12.1.1.6. State Value Function
- Expected return at state \(s\), that varies depending on the policy \(\pi\).
- \[ v_\pi(s) := \mathrm{E}_\pi[G_t\mid S_t = s]. \]
- It is typically impossible to find it directly.
- Terminal states has value of zero by definition.
12.1.1.7. Action Value Function
- Expected return of a state-action pair \((s, a)\). It also depends on \(\pi\).
- \[ q_\pi(s,a) := \mathrm{E}_\pi[G_t\mid S_t = s, A_t = a] \]
12.1.2. Bellman Equations
For any policy \(\pi\),
\begin{align*} v_\pi(s) &= \sum_{a\in \mathcal{A}(s)}\pi(a|s)q_\pi(s,a)\\ q_\pi(s,a) &= \sum_{s'\in\mathcal{S}, r\in\mathcal{R}}p(s',r|s,a)[r+\gamma v_\pi(s')]\\ \end{align*}- This is due to the fact \[ \mathrm{E}_\pi[G_{t+1}\mid s, a] = \mathrm{E}_\pi[v_\pi(S_{t+1})\mid s,a] \] by the law of total probability
Consequently,
\begin{align*} v_\pi(s) &= \sum_{a\in \mathcal{A}(s)}\pi(a | s)\sum_{s'\in \mathcal{S}, r\in \mathcal{R}} p(s', r |s, a)(r+\gamma v_\pi(s'))\\ q_\pi(s,a) &= \sum_{s'\in\mathcal{S}, r\in\mathcal{R}}p(s',r|s,a)\left[r+\gamma \sum_{a'\in \mathcal{A}(s')}\pi(a'|s')q_\pi(s',a')\right] \end{align*}
12.1.3. Optimization
- The goal is to find the optimal policy \(\pi_*\) which satisfies:
- \[ \forall \pi, \forall s \in \mathcal{S}, v_*(s) := v_{\pi_*}(s) \ge v_\pi(s), \]
- \[ \forall \pi, \forall s\in \mathcal{S}, \forall a \in \mathcal{A}(s), q_*(s,a) := q_{\pi_*}(s,a) \ge q_\pi(s,a). \]
- The existence of such policy is guaranteed.
12.1.3.1. Bellman Optimality
- \[ v_*(s) = \max_{a\in\mathcal{A}(s)} q_*(s, a). \]
- This must be the case, since otherwise it is not optimal.
- There could be multiple actions values that are equal, in that case there's multiple optimal policy.
12.2. Generalized Policy Iteration
- GPI
Every reinforcement learning can be seen as a generalized policy iteration
\[ \pi_0 \xrightarrow{E} v_{\pi_0} \xrightarrow{I} \pi_1 \xrightarrow{E} \dots \xrightarrow{I} \pi_* \xrightarrow{E} v_* \]
12.2.1. Policy Evaluation
12.2.1.1. Model-Based
- The approximate value of the state value function and action value function can be calculated by repeatedly calculating the value at each point \(V(s)\), \(Q(s,a)\) according to the Bellman equations.
- It is guaranteed to converge.
12.2.1.2. Value Iteration
- The value does not needs to reach convergence.
12.2.2. Model-Free
- Since the model is unknown, estimating the action value is more feasible.
12.2.2.1. Monte Carlo Evaluation
- See Monte Carlo method.
- Estimate the state value function of the action-value pair state \(s := (\tilde{s}, \tilde{a})\) within the Markov reward process. \(V(s) = Q(\tilde{s},\tilde{a})\).
- Given \(M\) trajectory samples of an MRP, it can be
approximated by the average: \[
\mathrm{E}_\pi[G_t\mid S_t = s] \approx V(s) = Q(\tilde{s},\tilde{a}) := \frac{1}{C(s)}\sum_{m=1}^M\sum_{\tau = 0}^{T_m-1}\mathbf{1}[s_\tau^m = s]g_\tau^m
\]
- where \(C(s)\) is the counting function of the occurence of \(s\)
- It can be divided into updates after each trajectory
following the update rule: \[
Q(\tilde{s}_t^m, \tilde{a}_t^m) = V(s_t^m) \leftarrow V(s_t^m) + \frac{1}{C(s_t^m)}(g_t^m - V(s_t^m))
\]
- here \(C(s_t^m)\) only counts up to the \(m\)th trajectory, up to time \(t\) within the last trajectory.
- \(g_t^m\) for each \(s_t^m\) are determined after the \(m\)th trajectory terminates. Then, the states within that trajectory would be updated.
- It correctly calculates the average at each iteration.
- \[ \frac{(V(s_t^m)-1)V(s_t^m) + g_t^m}{C(s_t^m)} \]
- Target is what the estimates are moved towards at each update.
12.2.2.2. Constant-α Monte Carlo
- The update rule is modified to be \[ V(s_t^m) \leftarrow V(s_t^m) + \alpha (g_t^m - V(s_t^m)) \]
- where the stepsize \(\alpha\in (0,1]\) is a constant.
- If \(\alpha\) is small, it converges slowly, if \(\alpha\) is large, the estimate is noisy. The best hyperparameter \(\alpha\) is problem dependent.
12.2.2.3. Off-Policy Method
In on-policy method, the behavior policy \(b\) is equal to the target policy \(\pi\) \(b(a|s) = \pi(a|s)\). But in off-policy method \(b\neq\pi\), so the importance sampling is required: \[ q_\pi(s,a) = \mathrm{E}_\pi[G_t\mid S_t = s, A_t = a] = \mathrm{E}_b\left[\rho G_t\mid S_t = s, A_t = a\right] \] where \(\rho\) is the product of the ratio of probability over the remainder of the trajectory: \[ \rho := \frac{p_\pi(G_t)}{p_b(G_t)} = \prod_{\tau = t+1}^{T-1}\frac{\pi(A_\tau \mid S_\tau)}{b(A_\tau\mid S_\tau)}. \]
- \(\rho_t^m\) is multiplied to the plain \(g_{t,b}^m\) to obtain \(g_t^m\) at each update phase.
- Coverage is required \[
\pi(a|s) > 0 \implies b(a|s) > 0
\]
- Behavior policy must be able to take all the actions that target policy takes.
- This ensures that there's always data in the limiting case.
- It explores more, has high variance.
12.2.2.4. n-Step Temporal Difference Learning
- n-Step TD
If an episode takes significant amount of time, or even infinite time, the process of learning is slow. So one needs to learn during an episode.
12.2.2.4.1. Target
The target \(g_t^m\) is modified to: \[ g_{t;t+n}^m = \sum_{\tau = t+1}^{t+n < T}\left(\gamma^{\tau-t-1}r_\tau^m\right) + \gamma^n V(s_{t+n}^m). \] If \(t+n \ge T\), it is \(g_{t;t+n}^m = g_t^m\).
This allows one to computer the return only with the knowledge through \(n\)-step forward into the future. Using a value estimate in the target is called bootstrapping
12.2.2.4.2. Hyperparameter
- \(n\) is a hyperparameter
- when \(n=\infty\) TD is identical to MC.
- \(\alpha\) can be larger compared to MC, when \(n\) is small. It converges faster without much noise.
- This is because the variance is reduced by bootstrapping, therefore larger steps can be taken.
- For small \(n\), typically slightly higher \(\alpha\) performs better.
- For larger \(n\), one learn more from a single episode.
Since the signal from farther away can be considered, when
\(n\) is large.
- So the exploration of state space is fast, yet noisier.
12.2.2.4.3. Properties
- Batch training is performing the training on a fixed batch, unlike the real time data.
- MC and TD arrive at different estimates under batch
training, MC finds the one that minimize mean-squared error,
which is achieved by average, and TD finds the one with the
maximum-likelihood of the MRP \(\hat{p}(s', r|s)\).
- \(\hat{p}(s',r|s) = C(s',r|s)/C(s)\) to estimate \(v\) is equivalent to 1-step batch TD.
TD's criteria is better for predicting returns.
- Because it more directly uses the assumption of the MRP.
12.2.3. Policy Improvement
- The optimal deterministic policy \(\pi_*\): \[ \pi_*(s) = \operatorname*{argmax}_a q_*(s,a), \]
- Therefore, the approximation of the optimal policy is chosen greedily: \[ \pi'(s) = \operatorname*{argmax}_a q_\pi(s,a). \]
12.2.3.1. Policy Improvement Theorem
- \[ \forall s\in \mathcal{S}, v_\pi(s) \le v_{\pi'}(s) \]
- If \(\pi\) is greedy with respect to \(v_{\pi'}\), then \(\pi = \pi_*\).
12.2.3.2. Monte Carlo Control
- By the constant-α Monte Carlo the Q-table is obtained. The policy can be improved greedily with respect to the Q-table: \[ \pi'(s) = \operatorname*{argmax}_aQ(s,a). \]
12.2.3.2.1. Exploration-Exploitation Trade-Off
- To discover optimal policies, all state-action pairs must be explored, but to get high returns, one must exploit known high value pairs.
- If the policy does not explore the unexplored state, the policy might never be able to discover the state with high state value.
- Therefore, the policy needs to be soft: \[
\forall s\in \mathcal{S}, \forall a\in \mathcal{A}(s), \pi(a|s) > 0.
\]
- with infinite data, \(\pi_*\) is always discoverable, if the policies are chosen to be soft.
12.2.3.2.2. ε-Greedy Policy
- With probability \(\epsilon\), take an action selected uniformly form \(\mathcal{A}(s)\) otherwise select the maximum Q.
12.3. Temporal Difference Control
- As long as each process does their job partially, the algorithm will converge toward the optimum.
12.3.1. n-Step Sarsa
- On Policy TD Control
- The temporal difference learning is used. And the updates are applied during the epoisodes.
- "Sarsa" comes from 1-step Sarsa, in which every update operates on \(s_t^m, a_t^m, r_{t+1}^m, s_{t+1}^m, a_{t+1}^m\).
12.3.2. Q-Learning
- n-Step Q-Learning
12.3.2.1. Target
- The target is adjusted to:
- \[
g_{t;t+n}^m = \sum_{\tau = t+1}^{t+n < T}\left(\gamma^{\tau-t-1}r_\tau^m\right) + \gamma^n \max_aQ(s_{t+n}^m, a).
\]
- If \(t+n \ge T\), it is \(g_{t;t+n}^m = g_t^m\).
- Assume the best case scenario at the last state.
- \[
g_{t;t+n}^m = \sum_{\tau = t+1}^{t+n < T}\left(\gamma^{\tau-t-1}r_\tau^m\right) + \gamma^n \max_aQ(s_{t+n}^m, a).
\]
- The max operator means this is off-policy, since the return is not according to the policy.
- Under the behavior policy, \(Q\) is targeting \(q_*\). It is
following the policy \(\pi_\text{greedy}\) defined to choose the
maximum action value.
- In contrast to the on-policy methods, in which we are approximating the optimal policy with the previous best behavior policy.
- The updates happen before the actions are taken, unlike TD in which updates happen after the actions are taken.
12.3.2.2. Properties
- With the maximum operator the creeping in of nearby low values is prevented, as presented in the example of cliff walking.
12.3.3. Expected Sarsa
12.3.3.1. Target
- The target is adjusted to:
- \[
g_{t;t+n}^m = \sum_{\tau = t+1}^{t+n < T}\left(\gamma^{\tau-t-1}r_\tau^m\right) + \gamma^n \sum_a \pi(a|s_{t+1}^m)Q(s_{t+n}^m, a).
\]
- If \(t+n \ge T\), it is \(g_{t;t+n}^m = g_t^m\).
- \[
g_{t;t+n}^m = \sum_{\tau = t+1}^{t+n < T}\left(\gamma^{\tau-t-1}r_\tau^m\right) + \gamma^n \sum_a \pi(a|s_{t+1}^m)Q(s_{t+n}^m, a).
\]
- Expected return at the last state.
- It is less noisy, but the computation is more expensive.
- This is on-policy method, if \(b = \pi\).
- Expected sarsa is more efficient version of sarsa. Sarsa estimates \(Q\) only with sampling, expected sarsa does it also with expected value.
12.4. Function Approximation
12.4.1. Generalization
If an agent only visits a minuscule subset of all states, how should it generalize that experience to select high reward actions over the remaining states?
- Supervised Learning
- Learn from noisy samples.
- SL applied for RL is called function approximation.
12.4.2. On Policy Evaluation
- When \(\mathcal{S}\) is large, the degree of freedom is reduced by assuming: \[ v_\pi(s) \approx \hat{v}(s, \mathbf{w}) \]
- where \(\mathbf{w}\in \mathbb{R}^d\).
- Since a state \(s\) is arbitrary there must exists features \(x_i : \mathcal{S}\to \mathbb{R}^n\) that describes the state.
12.4.2.1. Value Error
- \[ \overline{\mathrm{VE}}_\pi(\mathbf{w}) = \sum_{s\in\mathcal{S}}\mu_\pi(s)\left[v_\pi(s) - \hat{v}(s,\mathbf{w})\right]^2 \] where \(\mu_\pi\) is a distribution over states according to the policy \(\pi\).
12.4.2.2. Stochastic Gradient Descent
- The goad is to find \(\hat{v}\) that approximately optimize the value error.
12.4.2.2.1. Assume
- On policy: States are visited in proportion to \(\mu_\pi\).
- \(\hat{v}\) is differentiable with respect to \(\mathbf{w}\)
- The observation \(U_t\) is a surrogate for the target \(v_\pi(S_t)\)
12.4.2.2.2. Update Rule
- \[ \mathbf{w} \leftarrow \mathbf{w} + \alpha[U_t - \hat{v}(S_t, \mathbf{w})]\nabla\big|^\mathbf{w} \hat{v}(S_t, \mathbf{w}) \]
- If the target \(U_t\) is unbiased, then \(\mathbf{w}\) will converge to a local optimum of value error.
12.4.2.3. Gradient Monte Carlo
- MC with modified update rule.
12.4.2.3.1. Target
- \(U_t = G_t\)
12.4.2.3.2. Properties
- It may be significantly slow, since it requires an episode to end before it updates.
- If the value estimator function is a linear value function of
the form \[
\hat{v}(s,\mathbf{w}) = \mathbf{w}^\top\mathbf{x}(s)
\] where \(\mathbf{x}(s)\) is a feature vector, the local
minimum \(\mathbf{w}\) is the global mimimum \(\mathbf{w}\).
- Since the linearity makes the optimization convex.
12.4.2.4. Semi-Gradient Temporal Difference
- Semi-Gradient TD
12.4.2.4.1. Target
- \[ U_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) \]
- It's biased. Since \[ \mathrm{E}[\hat{v}(S_{t+1}, \mathbf{w})|S_t = s] \neq \mathrm{E}[v_\pi(S_{t+1}) | S_t = s] \implies \mathrm{E}[U_t| S_t=s] \neq v_\pi(s) \]
12.4.2.4.2. Properties
- Update rule is not a true gradient step, because \(U_t\)
depends on \(\hat{v}\) which depends on varying
\(\mathbf{w}\).
- \[ \nabla (\overline{\mathrm{VE}}_\pi(\mathbf{w}, U_t)) \neq (\nabla \overline{\mathrm{VE}}_\pi)(\mathbf{w}, U_t) \] therefore following the gradient does not maximally minimizes the value error.
- And \(\mathrm{E}[\hat{v}(S_t, \mathbf{w})] \neq \mathrm{E}[v_\pi(S_t)]\) either, even for the VE optimum \(\hat{v}\).
- That means we don't get graranteed convergence
- In 1-step case, for the linear value function, \(\mathbf{w}\)
will converge to a TD fixed point \(\mathbf{w}_\text{TD}\).
- The closeness to the VE optimum is related to the attenuation factor \(\gamma\). For high \(\gamma\) near 1, the solution can be far apart from the true optimum.
12.4.3. On Policy Control with Function Approximation
- Assumption
- Action space \(\mathcal{s}\) is not large.
- Episodic case. otherwise it is different
- The state value function \(\hat{v}(s, \mathbf{w})\) is now action value function \(\hat{q}(s,a,\mathbf{w})\).
12.4.3.1. Semi-Gradient 1-Step Sarsa
12.4.3.1.1. Target
- \[ U_t = R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}) \]
- Normalized radial basis used as the features for a state.
12.4.4. Off Policy Methods with Function Approximation
12.4.4.1. The Deadly Triade
- Function Approximation
- Off-Policy Training
- Bootstrapping
- \(\leadsto\) stability and divergence issues.
- It is inevitable.
12.5. Policy Gradient Methods
- Policy Gradient Methods | Reinforcement Learning Part 6 - YouTube
- More direct approach to the problem of finding the optimal policy.
- The goal is to directly determine the parameter \(\bm{\theta} \in \mathbb{R}^{d'}\) that optimizes the policy \(\pi(a|s, \bm{\theta})\).
12.5.1. Reinforce Algorithm
- Monte Carlo algorithm
- Update happens after a full episode is performed.
- Assume
- Policy \(\pi(a|s, \bm{\theta})\) is differentiable
- Initial setup
- Functional form of \(\pi\)
- Initial \(\bm{\theta}\)
- Step size \(\alpha\)
- Softmax function is often used to unconstrain the
parameter \(\theta_i\).
- And the policy at one point is the average of the parameters weighted by the normalized radial basis, and then applying softmax to it.
- It is not exactly linear but good enough.
- Update Rule
- \[ \bm{\theta} \leftarrow \bm{\theta} + \alpha g_t^m \nabla\ln\pi(a_t^m|s_t^m, \bm{\theta}) \]
- This reinforces the probability of taking action \(a_t^m\) when in state \(s_t^m\) in proportion to the return \(g_t^m\).
- \(\ln\) is there to compensate for the effect of selection
frequency on descent.
- The high selection rate is exactly canceled by dividing the step by the propability.
12.5.2. Reinforce with Baseline
- If \(\forall g_t^m \gg 1\), the optimization will take quite long.
because the reinforce algorithm will never learn to not do things,
only do more good things.
- Replace \(g_t^m\) with \(g_t^m - b(s_t^m)\), where \(b\) is called the baseline.
- Baseline can be any function that does not depend on action and does not make the problem diverge.
- Natural choice of the baseline is the state value, the expected
return of that state:
- \[ b(s_t^m) = \hat{v}(s_t^m, \mathbf{w}) \]
- The agent would prefer the action that increases the expected return.
12.5.2.1. Update Rule
- It is exactly the combination of ((66865702-a5f4-4c57-a58b-0a182067473b)) and Reinforce algorithm.
12.5.3. Policy Gradient Theorem
- The objective to optimize in the policy gradient method is
- Episodic case: \[ v_{\pi_{\bm{\theta}}}(s_0) = \mathrm{E}_{\pi_{\bm{\theta}}}[G_0|S_0 = s_0] \]
- Theorem
- \[
\nabla\big|^{\bm{\theta}} v_{\pi_{\bm{\theta}}}(s_0) \propto \sum_{s\in \mathcal{S}}\mu_{\pi_{\bm{\theta}}}(s)\sum_{a\in \mathcal{A}(s)}q_{\pi_{\bm{\theta}}}(s, a)\nabla\big|^{\bm{\theta}}\pi(a|s, \bm{\theta})
\]
- The gradient of the state value of the initial state, i.e. the gradient of the total return is proportional to the action value weighted average of the gradient of policy at each state-action pair.
- \[
\nabla\big|^{\bm{\theta}} v_{\pi_{\bm{\theta}}}(s_0) \propto \sum_{s\in \mathcal{S}}\mu_{\pi_{\bm{\theta}}}(s)\sum_{a\in \mathcal{A}(s)}q_{\pi_{\bm{\theta}}}(s, a)\nabla\big|^{\bm{\theta}}\pi(a|s, \bm{\theta})
\]
If an algorithm approximates this direction \(\nabla\big|^{\bm{\theta}}v_{\pi_{\bm{\theta}}}\), it'll approximately optimize the objective.
12.5.4. Properties
- PGM may be slow at converging because it takes smooth transition for it to converge, but because of that it is more likely to converge eventually without getting stuck.
- It solves only for the policy.
- No detailed knowledge is gained.
- It bypasses the complexity.
- It can learn stochastic policies.
- The level of stochasticity can be learned!
- On the other hand, \(\epsilon\)-greedy method only has a fixed stochasticity.
- It avoids computationally intensive argmax operations.
12.5.5. Examples
- Proximal Policy Optimization
13. Interpretation
13.1. Mechanistic Interpretability
- The embedding of the input tokens are modified as it goes through attention layer and multilayer perceptron. The sequence of modified embeddings is called the residual stream.
- The final row that determines the next token can be unembedded at any point within the residual stream. Often the softmax is used that limits the models confidence in next token. When the unembedded tokens changes drastically in quality, it can be suspected that the layer in which the change occured controls certain quality in the next token.
- A single neuron exhibit polysemanticity, in which a concept is expressed with a combination of neurons and the neural network can learn more concept than they have neurons.
- The final row can be transformed into a sparse concept/feature vector with larger length, using a sparse autoencoder.
- rare features are harder to extract using the autoencoder.
- Cross-layer superposition is being under research.
14. History
- AlphaGo beats Lee Sedol in 2016.
14.1. Moore's Law
AI's Version of Moore's Law? - Computerphile - YouTube The power of the AIs are growing exponentially, in terms of the time for the experts to complete a task that AI gives 80% correct result.
15. Products
15.1. Image Generation
- Imagen - Google
- Flux.1 - Black Forest Labs
- Midjourney
15.2. Language Model
- GPT - OpenAI
- ChatGPT
- Grok - X
- Llama - Meta
- PaLM - Google
16. Tools
- Python Libraries: and TensorFlow
- Regression is producing a model for continuous variable, and classification is for discrete variable.
17. Applications
- Nonlinear coordinate transformation using autoencoder
- Linear approximation of the nonlinear dynamics can be found using SINDy
- SINDy uses the loss function:
18. Reference
- Rectifier (neural networks) - Wikipedia
- Softmax function - Wikipedia
- Machine learning - Wikipedia
- The Diffusion Model's Unsung Sidekick: A Science of Solving (Almost) Any Prob…
- Why Does Diffusion Work Better than Auto-Regression? - YouTube
- Stochastic gradient descent - Wikipedia
- Learning rate - Wikipedia
- How AI Learns Concepts - YouTube
- How Computers Derive Equations - YouTube
- Reinforcement Learning By the Book - YouTube
- [1] R. Sutton and A. Barto. Reinforcement learning: An Introduction (2nd Ed). MIT Press, 2018.
- [2] H. Hasselt, et al. RL Lecture Series, Deepmind and UCL, 2021, DeepMind x UCL RL Lecture Series - Introduction to Reinforcement Learning
- [3] J. Achiam. Spinning Up in Deep Reinforcement Learning, OpenAI, 2018
- [4] J. Achiam, Spinning Up in Deep Reinforcement Learning: Intro to Policy Optimization, OpenAI, 2018, Part 3: Intro to Policy Optimization — Spinning Up documentation
- [5] D. Silver, Lecture 7: Policy Gradient Methods, Deepmind, 2015, RL Course by David Silver - Lecture 7: Policy Gradient Methods - YouTube
- The Dark Matter of AI {Mechanistic Interpretability} - YouTube
- Deep Learning to Discover Coordinates for Dynamics: Autoencoders & Physics In…